(b) and (c), respectively. It can be seen that the data presented in
29 (a) may have the worst clustering performance and the data
d in Figure 2.29 (c) may have the best clustering performance.
n the comparison between three panels, it can be seen that the
he between-cluster distance, the better the discrimination power
ing performance.
(a) (b) (c)
Three scenarios to show the impact of the between-cluster variance on the
performance. The dots stand for the data points and the triangles stand for the
res. ‘Sb’ stands for the between-cluster sum of squares.
utputs of the kmeans function include several sums of squares.
ut named as withinss is a vector and is called the within-
m of squares. Each entry of the vector is the variance of a cluster,
within-cluster variance. Such a variance is the sum of the squared
between the centre of a cluster and all data points which have
sified into the cluster. The output named as tot.withinss is
of withinss and stands for the total within-cluster variance. It
d by ܵௐ
ଶ. The output named as betweenss stands for the sum
uared distances between the cluster centres and is named as the
cluster variance. It is denoted by ܵ
ଶ.
mmary, two statistics (tot.withinss or ܵௐ
ଶ and betweenss
n be used to assess the performance of a cluster model. Based on
ଶ, the well-known F statistic used in ANOVA can be considered
mising a K-means model structure. The F statistic is defined as
r which a p value can be calculated for the significance evaluation,